31 research outputs found
Random Indexing K-tree
Random Indexing (RI) K-tree is the combination of two algorithms for
clustering. Many large scale problems exist in document clustering. RI K-tree
scales well with large inputs due to its low complexity. It also exhibits
features that are useful for managing a changing collection. Furthermore, it
solves previous issues with sparse document vectors when using K-tree. The
algorithms and data structures are defined, explained and motivated. Specific
modifications to K-tree are made for use with RI. Experiments have been
executed to measure quality. The results indicate that RI K-tree improves
document cluster quality over the original K-tree algorithm.Comment: 8 pages, ADCS 2009; Hyperref and cleveref LaTeX packages conflicted.
Removed clevere
Workflows, processes and technical solutions for seeding the research data commons
Queensland University of Technology (QUT) completed an Australian National Data Service (ANDS) funded “Seeding the Commons Project” to contribute metadata to Research Data Australia. The project employed two Research Data Librarians from October 2009 through to July 2010. Technical support for the project was provided by QUT’s High Performance Computing and Research Support Specialists. ---------- The project identified and described QUT’s category 1 (ARC / NHMRC) research datasets. Metadata for the research datasets was stored in QUT’s Research Data Repository (Architecta Mediaflux). Metadata which was suitable for inclusion in Research Data Australia was made available to the Australian Research Data Commons (ARDC) in RIF-CS format. ---------- Several workflows and processes were developed during the project. 195 data interviews took place in connection with 424 separate research activities which resulted in the identification of 492 datasets. ---------- The project had a high level of technical support from QUT High Performance Computing and Research Support Specialists who developed the Research Data Librarian interface to the data repository that enabled manual entry of interview data and dataset metadata, creation of relationships between repository objects. The Research Data Librarians mapped the QUT metadata repository fields to RIF-CS and an application was created by the HPC and Research Support Specialists to generate RIF-CS files for harvest by the Australian Research Data Commons (ARDC). ---------- This poster will focus on the workflows and processes established for the project including: ---------- • Interview processes and instruments • Data Ingest from existing systems (including mapping to RIF-CS) • Data entry and the Data Librarian interface to Mediaflux • Verification processes • Mapping and creation of RIF-CS for the ARD
The Benefits of Word Embeddings Features for Active Learning in Clinical Information Extraction
This study investigates the use of unsupervised word embeddings and sequence
features for sample representation in an active learning framework built to
extract clinical concepts from clinical free text. The objective is to further
reduce the manual annotation effort while achieving higher effectiveness
compared to a set of baseline features. Unsupervised features are derived from
skip-gram word embeddings and a sequence representation approach. The
comparative performance of unsupervised features and baseline hand-crafted
features in an active learning framework are investigated using a wide range of
selection criteria including least confidence, information diversity,
information density and diversity, and domain knowledge informativeness. Two
clinical datasets are used for evaluation: the i2b2/VA 2010 NLP challenge and
the ShARe/CLEF 2013 eHealth Evaluation Lab. Our results demonstrate significant
improvements in terms of effectiveness as well as annotation effort savings
across both datasets. Using unsupervised features along with baseline features
for sample representation lead to further savings of up to 9% and 10% of the
token and concept annotation rates, respectively
Real, complex, and binary semantic vectors
This paper presents a combined structure for using real, complex, and binary valued vectors for semantic representation. The theory, implementation, and application of this structure are all significant. For the theory underlying quantum interaction, it is important to develop a core set of mathematical operators that describe systems of information, just as core mathematical operators in quantum mechanics are used to describe the behavior of physical systems. The system described in this paper enables us to compare more traditional quantum mechanical models (which use complex state vectors), alongside more generalized quantum models that use real and binary vectors. The implementation of such a system presents fundamental computational challenges. For large and sometimes sparse datasets, the demands on time and space are different for real, complex, and binary vectors. To accommodate these demands, the Semantic Vectors package has been carefully adapted and can now switch between different number types comparatively seamlessly. This paper describes the key abstract operations in our semantic vector models, and describes the implementations for real, complex, and binary vectors. We also discuss some of the key questions that arise in the field of quantum interaction and informatics, explaining how the wide availability of modelling options for different number fields will help to investigate some of these questions
Semantic oscillations: encoding context and structure in complex valued holographic vectors
In computational linguistics, information retrieval and applied cognition, words and concepts are often represented as vectors in high dimensional spaces computed from a corpus of text. These high dimensional spaces are often referred to as Semantic Spaces. We describe a novel and efficient approach to computing these semantic spaces via the use of complex valued vector representations. We report on the practical implementation of the proposed method and some associated experiments. We also briefly discuss how the proposed system relates to previous theoretical work in Information Retrieval and Quantum Mechanics and how the notions of probability, logic and geometry are integrated within a single Hilbert space representation. In this sense the proposed system has more general application and gives rise to a variety of opportunities for future research
Predicting sense convergence with distributional semantics: an application to the CogALex-IV 2014 shared task
This paper presents our system to address the CogALex-IV 2014 shared task of identifying a single word most semantically related to a group of 5 words (queries). Our system uses an implementation of a neural language model and identifies the answer word by finding the most semantically similar word representation to the sum of the query representations. It is a fully unsupervised system which learns on around 20% of the UkWaC corpus. It correctly identifies 85 exact correct targets out of 2,000 queries, 285 approximate targets in lists of 5 suggestions
Analogical frames by constraint satisfaction
This research develops a new and efficient constraint satisfaction approach to the unsupervised discovery of linguistic analogies. It shows that systems of analogies can be discovered with high confidence in natural language text by a computer program without human input. The discovery of analogies is useful for many applications such as the construction of linguistic resources, natural language processing and the automation of inference and reasoning
Clustering with random indexing K-tree and XML structure
This paper describes the approach taken to the clustering task at INEX 2009 by a group at the Queensland University of Technology. The Random Indexing (RI) K-tree has been used with a representation that is based on the semantic markup available in the INEX 2009 Wikipedia collection. The RI K-tree is a scalable approach to clustering large document collections. This approach has produced quality clustering when evaluated using two different methodologies
Building an Australian user community for VIVO
The Australian National Data Service (ANDS) was established in 2008 and aims to: influence national policy in the area of data management in the Australian research community; inform best practice for the curation of data, and, transform the disparate collections of research data around Australia into a cohesive collection of research resources One high profile ANDS activity is to establish the population of Research Data Australia, a set of web pages describing data collections produced by or relevant to Australian researchers. It is designed to promote visibility of research data collections in search engines, in order to encourage their re-use. As part of activities associated with the Australian National Data Service, an increasing number of Australian Universities are choosing to implement VIVO, not as a platform to profile information about researchers, but as a 'metadata store' platform to profile information about institutional research data sets, both locally and as part of a national data commons. To date, the University of Melbourne, Griffith University, the Queensland University of Technology, and the University of Western Australia have all chosen to implement VIVO, with interest from other Universities growing
Table of Contents
Above: Visualisation of flow on the disk surface of a chemical vapor deposition reactor